Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Term Count on Search Results #3920

Closed
nik9000 opened this issue Oct 16, 2013 · 11 comments
Closed

Term Count on Search Results #3920

nik9000 opened this issue Oct 16, 2013 · 11 comments

Comments

@nik9000
Copy link
Member

nik9000 commented Oct 16, 2013

Would anyone else be interesting in getting elasticsearch to return a count of the terms in a field in the search results? If you (like me) need to return a word count of a field then this could be useful to you. I also could get a count of distinct terms but I'm not super sure who'd use it. I was thinking the api could be something like this:

curl -XPOST "http://localhost:9200/test/test/_search?pretty" -d '{
  "fields": [ "foo._term_count" ],
  "query": {
    "query_string": {
      "query": "findme"
    }
  }
}'

And it'd return "foo._term_count" : 6, in the results.

It'd require term_vectors to be stored but not offsets or positions. Since it'd count the terms on each search result it'd be similar to highlighting using the FVH but faster because it does essentially no work other than the term vector scanning.

I don't imagine you'd be able to sort by them.

@brwe
Copy link
Contributor

brwe commented Oct 17, 2013

Might this requirement be similar to #3924 ? Also I am curious: What is your use case?

@nik9000
Copy link
Member Author

nik9000 commented Oct 17, 2013

Sorry I wasn't clear. On my search results page I have to return a word count of one of the fields for every search result. It happens to be my longest field. And it has to support scriptio continua languages so I can't do something simple like count the number of spaces in my app and save that number to ES to retrieve with the search results. Anyway, Elasticsearch has a word count already in the form of the per field per document term vectors that I already store to use the FVH. Also luckily I process that field with an analyzer that doesn't add synonyms or funky word breaks. If I can ask Elasticsearch to count the terms in that field that'll give me my word count.

Anyway, it doesn't what I need is pretty simple in comparison to the term vector api. I won't be listing terms and I only want term information for a single document. I also want it bundled in the search results so I don't have to make any additional requests.

I'll send a pull request that implements this today or tomorrow which should make it crystal clear

@s1monw
Copy link
Contributor

s1monw commented Oct 17, 2013

just for kicks, can you build a customer analyzer that consumes all tokens and returns the number of tokens in the field as a token and then sort by it. You would need to parse the string but it would work no?

@synhershko
Copy link
Contributor

+1. Use cases can include faceting, scripted scoring, record linkage and whatnot.

@s1monw all that is required is a custom TokenFilter really, but that token doesn't have access to the IW / Document object so you will need to go through the analysis chain twice

@nik9000
Copy link
Member Author

nik9000 commented Oct 17, 2013

record linkage

Sorry, what do you mean?

custom TokenFilter

I like this idea. In that case it'd make sense to build the field in the mapping, maybe like this:

curl -XPUT http://localhost:9200/test/test/_mapping?pretty -d'{
  "test" : {
    "properties": {
      "foo" : {
        "type": "string",
        "store": "yes",
        "write_term_count" : "foo_term_count"
      },
      "foo_term_count" : {
        "type": "integer",
        "store": "yes"
      }
    }
  }
}

It'd be a pain to have to use the custom analyzer and analyze everything twice but it'd be worth it if it enables lots of fun features. I'll have a look later today I think.

@synhershko
Copy link
Contributor

Record linkage is whenever you want to find similar documents, and word count can be a good hint for that.

@javanna
Copy link
Member

javanna commented Oct 18, 2013

This other issue looks similar to what was asked here, although it proposes a separate api for it: #640 .

@synhershko
Copy link
Contributor

I think someone is confusing Word-Count in a field of a specific document with Term Count of all documents in a field. Not sure who that is, though :)

@javanna
Copy link
Member

javanna commented Oct 19, 2013

Indeed, that other issue is a completely different story, I should have read more carefully. Thanks for clarifying that @synhershko

@nik9000
Copy link
Member Author

nik9000 commented Oct 21, 2013

I got this working today. I'll send a pull request for it as soon as it passes all of its tests. Github has helpfully created a link to my implementation above for anyone curious. The unit test covers returning the count in the search results, searching for it via a range query, and faceting. It covers counting both single and multi-valued fields both on the root and inside of an object. For multi-valued fields it writes multiple term counts - it doesn't add them.

@nik9000
Copy link
Member Author

nik9000 commented Oct 21, 2013

Also, while I think about it I'm pretty sure I did a few things wrong and would love some tips on the right way:

  1. Create a new package for the implementation. I'm sure there is some place simple where it belongs.
  2. I'm a bit hacky in the way that I override the Long field implementation and in the way I use that field regardless of what field the term counts are actually mapped to. Now that I think about it I didn't do a lot of testing around mapping the term count to things other than long. I mean, it won't work at all if you map it to something non-number, but short and int and the like should work as expected.
  3. Everything else I haven't thought of:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants